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Abstract 



We consider stochastic convex optimization with a strongly convex (but not neces- 
sarily smooth) objective. We give an algorithm which performs only gradient updates 
with optimal rate of convergence. 

1 Setup 

Consider the problem of minimizing a convex function on a convex domain /C: 



Assume that we have an upper bound on the values of /, i.e. a number M > such 
that for any xi,X2 G /C, we have |/(xi) — /(x2)| < M. Also, assume we can compute 
an unbiased estimator of a subgradient of / at any point x, with L2 norm bounded by 
some known value G. Assume that the domain fC is endowed with a projection operator 
]^^(y) = argmiuxec ||x — y||. Finally, we assume that / satisfies the following inequality 



where x* is the point in /C on which / is minimized. This property holds, for example, if / 
is A-strongly-convex. The canonical example of such an optimization problem is support- 
vector-machine training. 

2 The algorithm 

The algorithm is a straightforward extension of stochastic gradient descent. The new feature 
is the introduction of "epochs" inside of which standard stochastic gradient descent is used, 
but in each consecutive epoch the learning rate decreases exponentially. 

3 Analysis 

Our main result is the following Theorem 



min/(x). 

xgAT 



/(x)-/(x*)>A||x-x^ 
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Algorithm 1 Epoch-GD 



1: Input: parameters M,G, A, and error tolerance e. Initialize xj G /C arbitrarily. 

2: for A; = 1 to [logg do 

3: Set Vk — 2F^; ^'^d start an epoch as follows 

4: for f = 1 to Tk do 

5: Let the estimated subgradient of / at x^ be gt 

6: Update 

K. 

7: end for 

8: Set xj^^^ = ^ Y^=i (o'^ Pic^ iterate at random). 

9: end for 

10: return x^'+^. 



Theorem 3.1. The final point x^"*""^ returned by the Epoch-GD algorithm, with parameters 



and r]k = has the property that E[/(x^^^)] — /(x*) < e. The total number 



of gradient updates is 0{ 



4G- 

I 

Ae 



The inter-epoch use of standard gradient decent is analyzed using the following Lemma 
from % 

Lemma 3.2 (Zinkevich [|^). Let D = ||x* — xi||2 and ||gi|| < G. Apply T iterations of the 
update xt+i = H/c {^t - VSt}- Then 

j2kf{^t-^l<r,G' + ^. 
t=i ' 

Now, if we set gt to be the unbiased estimator of a subgradient g^ of / at xj, then by 
the convexity of /, we get 

/(xt)-/(x*) < gt.(xt-x*)=Et_ife.(xt-x*)], 

where Ej_i[-] denotes expectation conditioned on all the randomness up to round t — 1. 
This immediately implies the following: 

Lemma 3.3. Let D = ||x* — xi||2. Apply T iterations of the update xt+i = Hk: ~ ^S*}? 
where gt is an unbiased estimator for the (sub)gradient of f at x^ satisfying \\gt\\ < G. 
Then ^ 

lE[J]/(xO]-/(x*)<r/G2 + ^. 

By convexity of f , we have the same bound for E[/(x)] — /(x*), where x = Ylt=i ^t- 
Define = /(xj^) — /(x*). Using Theorem we prove the following key lemma: 
Lemma 3.4. For any k, we have E[A/;] < Vk- 



2 



Proof. We prove this by induction on k. The claim is true for k = 1 since < M. 
Assume that E[Afc] < Vk for some k > 1 and now we prove it for k + I. For a random 
variable X measurable w.r.t. the randomness defined up to epoch k + 1, let Efc[X] denote 
its expectation conditioned on all the randomness up to phase k. By Lemma |3.3| we have 



Efc[/(xt+i)]-/(x*) <mG' + 

<rjkG'^ + ^^^^ (by A-strong convexity) 



A, 



and hence, 



E[Afc] 



Vk 



VkTk>^ 



< 



Vk 



Vk 



as required. The second inequality uses the induction hypothesis, and the last inequality 
and equality use the definition of Vk and the values r]k = and Tk = \ ^\y^ 1 • 



□ 



We can now prove our main theorem: 



Proof of Theorem 3J_. By the previous claim, taking k = \\0g2 — ] we have 

M 



E[/(x^+i)] - /(x*) = E[Afc+i] < V, 



2k 



as claimed. 

To compute the total number of gradient updates, we sum up along the epochs: in each 
-I6G" 



epoch k we have = [^^1 gradient updates, for a total of 



E 

k=l 



XVk 



< 



E 

k=l 



16G^ ■ 2 

am" 



+ 1 < 



20G^ 
^7 



assuming that [log2 — ] < ^r- 



□ 



4 Conclusions 

Extension of the above result to stochastic optimization of strongly convex functions with 
respect to norms other than the Euclidean norm are straightforward via standard online 
learning techniques. A factor two speedup can be obtained by stoping the epoch at a 
random point. 

We thank Nati Srebro for bringing the problem of deriving an efficient attention algo- 
rithm for stochastic strongly-convex optimization to our attention. 
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A High probability bounds 

We briefly sketch how using essentially the same algorithm with slightly more iterations, 
we can get a high probability guarantee on the quality of the solution. The update in line 
6 requires a projection onto a smaller set, and becomes 



6: Update x^^^ = J] {x,^ 

/CnBvfe(xJ) 



Vkit 



Here Br(x) denotes the L2 ball of radius r around the point x. We assume that such a 
projection can be computed very efficiently. In particular, if /C = M", then the projection 
is simply a scaling down of the vector towards the center of the ball. 
We prove: 

Theorem A.l. The final point xj^"*^ returned by the modified Epoch-GD algorithm, with 
parameters Tk = 



WOG^ ln(l/<5) 



ijk = where 6 = , has the property that 



/(x^"*"^) — /(x*) < e with probability at least 1 — 6. The total number of gradient updates is 
^^ G^ \og{l/5) y 



The following Lemma is analogous to Lemma 3.3, but provides a high probability guar- 
antee. 

Lemma A. 2. LetD = ||x*— xi||2. ApplyT iterations of the updatext+i = YlicnBoi^x-i) ^-^^ ~ ^S*}' 
where gt is an unbiased estimator for the (sub)gradient of f at x^ satisfying ||gt|| < G. Then 
with probability at least 1 — 5 



i^/(xO-/(x*) <vG' + ^ + 



2 8GDJln{l/6) 



By the convexity of f, the same bound holds for /(x) — /(x*), where x = ^ Ylt=i hi- 
proof. Let g( = E(_i[gt], a subgradient of / at x^, where Et_i[-] denotes the expectation 
conditioned on all randomness up to round t—1. Consider the martingale difference sequence 
given by 

Xt = gf (xf - X*) - gt ■ (xt - X*). 
We can bound \Xt\ as follows: 

< ||gt||||xt-x*||+Et_i[||gt||]||xt-x*|| < AGD, 

where the last inequality uses the fact that x^ € i?£)(xi), and hence by the triangle inequality 
— x*|| < ||xt — xill + ||xi — x*|| < 2D. 

By Azuma's inequality (see Lemma A.4), with probability at least 1 — 5, the following 
holds: 



1 j; 1 ^ 8GDJlnil/d) 
-j;g,.(x,-x*)--j;g,.(x,-x*) < . (1) 



t = l t=l ^ ^ 

Note that by the convexity of /, we have /(xt) — /(x*) < gt ■ (xj — x*). Then, by using 
Lemma ^]2| and inequality ([l|), we get the claimed bound. □ 



4 



We can now proceed along the same lines as Theorem 3.1 and prove the same result 
with high probability, the derivation is completely analoguous. 



Lemma A. 3. For an appropriate choice of r]k,Tk, the following holds. For any k, with 
probability (1 — 6)^ we have < V^. 

Proof. We prove this by induction on k. The claim is true for k = 1 since < M. Assume 
that Afc < Vfe for some A; > 1 with probability at least (1 — and now we prove it for 
k + 1. We condition on the event that A^ < Vk- By Lemma |A.2| , we have with probability 
at least 1 — 5, 

Afc+i = /(x^+i) - /(x*) 

^ ^2 l|x^-x*f 8G||x^-x*||^ln(l/5~) 

<'nkG^ + ^— — H -==!- (by Lemma |A 



^2 Afc , ^G^k^Hl/5) 



< VkG H — — H ; (by A-strong convexity) 



^ Vk , 8GVVk^/Hl/6) 



<rjkG H — — H 7== (by the conditioning) 

VkTkX VXTk 



Let Tk 



lOOG^ \n{l/5) 



AVfc 



, and we get 



/(xt+i) - /(x*) < 7ikG^ + - ^ ^ + ^ 

^ ' ' ryfc 100G2 ln(l/5) 10 



Next set r/fc = and we get that 

^ < 

10 101n(l/5) 10 ~ 2 



A.+i = /(x^i) - /(X*) < ^ + — ^ + ^ < ^ = Vk^i. 



Factoring in the conditioned event, which happens with probability at least {1 — 5)^, overall, 
we get that l^k+i < ^k+i with probability at least (1 — 5)^~^^. □ 



We can now prove our high probability theorem: 



Proof of Theorem A.l . By the previous claim, taking k = [log2 — ] we have with probability 

\k 



at least (1 - 5Y that 



M 



/(x^+i) - /(x*) = Afc+i < Vk+i = ^ < 



Since 5 = and hence (1 — 5)^ > 1 — 5 as needed. 
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To compute the total number of gradient updates, we sum up along the epochs: in each 
epoch k we have = 0{^-^^^^) gradient updates, for a total of 



T-^G^ log(l/5) 



fe=i 



k=\ 



AM 



O 



g'log(l/^) 



□ 



A.l Mcirtingale concentration lemma 

The following inequality is standard in obtaining high probability regret bounds: 

Lemma A.4 (Azuma's inequality). Let Xi, . . . ,Xt be a martingale difference sequence. 
Suppose that \Xt\ < b. Then, for 5 > 0, we have 



Pr 



> V262rin(l/5) 



< S. 
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